# EDA red Wine quality

Number of Instances: red wine - 1599, Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

summary(rw)
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Checked the data

length(!is.na(rw)) == (1599 * 13)
## [1] TRUE
# Means no any NA value in this dataset

Univariate Plots Section

table(rw$quality)
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
# Plot quality counts
ggplot(aes(x = quality), data = rw) + 
  geom_histogram(binwidth = 0.5)

## We can see the quality only including (3, 4, 5, 6, 7, 8) and center to 5 or 6
table(rw$alcohol)
## 
##              8.4              8.5              8.7              8.8 
##                2                1                2                2 
##                9             9.05              9.1              9.2 
##               30                1               23               72 
## 9.23333333333333             9.25              9.3              9.4 
##                1                1               59              103 
##              9.5             9.55 9.56666666666667              9.6 
##              139                2                1               59 
##              9.7              9.8              9.9             9.95 
##               54               78               49                1 
##               10 10.0333333333333             10.1             10.2 
##               67                2               47               46 
##             10.3             10.4             10.5            10.55 
##               33               41               67                2 
##             10.6             10.7            10.75             10.8 
##               28               27                1               42 
##             10.9               11 11.0666666666667             11.1 
##               49               59                1               27 
##             11.2             11.3             11.4             11.5 
##               36               32               32               30 
##             11.6             11.7             11.8             11.9 
##               15               23               29               20 
##            11.95               12             12.1             12.2 
##                1               21               13               12 
##             12.3             12.4             12.5             12.6 
##               12               13               21                6 
##             12.7             12.8             12.9               13 
##                9               17                9                6 
##             13.1             13.2             13.3             13.4 
##                2                1                3                3 
##             13.5 13.5666666666667             13.6               14 
##                1                1                4                7 
##             14.9 
##                1
# Plot alcohol counts
ggplot(aes(x = alcohol), data = rw) + 
  geom_histogram(binwidth = 0.1) + 
  coord_cartesian(xlim = c(9, 13))

table(rw$pH)
## 
## 2.74 2.86 2.87 2.88 2.89  2.9 2.92 2.93 2.94 2.95 2.98 2.99    3 3.01 3.02 
##    1    1    1    2    4    1    4    3    4    1    5    2    6    5    8 
## 3.03 3.04 3.05 3.06 3.07 3.08 3.09  3.1 3.11 3.12 3.13 3.14 3.15 3.16 3.17 
##    6   10    8   10   11   11   11   19    9   20   13   21   34   36   27 
## 3.18 3.19  3.2 3.21 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29  3.3 3.31 3.32 
##   30   25   39   36   39   32   29   26   53   35   42   46   57   39   45 
## 3.33 3.34 3.35 3.36 3.37 3.38 3.39  3.4 3.41 3.42 3.43 3.44 3.45 3.46 3.47 
##   37   43   39   56   37   48   48   37   34   33   17   29   20   22   21 
## 3.48 3.49  3.5 3.51 3.52 3.53 3.54 3.55 3.56 3.57 3.58 3.59  3.6 3.61 3.62 
##   19   10   14   15   18   17   16    8   11   10   10    8    7    8    4 
## 3.63 3.66 3.67 3.68 3.69  3.7 3.71 3.72 3.74 3.75 3.78 3.85  3.9 4.01 
##    3    4    3    5    4    1    4    3    1    1    2    1    2    2
# Plot pH counts
ggplot(aes(x = pH), data = rw) + 
  geom_histogram(aes(fill = 'red'), binwidth = 0.01)

# most of value between 2.8-3.8
# Plot sulphates counts as this feature have a long tail, scale value by log
ggplot(aes(x = sulphates), data = rw) + 
  scale_x_log10() +
  geom_histogram(binwidth = 0.01)

## Below plot shows this feature have long tail and most value internel c(0.3, 1.0)
# Plot density counts
ggplot(aes(x = density), data = rw) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# The desity show a very good normal distribution
# Plot total.sulfur.dioxide counts
ggplot(aes(x = total.sulfur.dioxide), data = rw) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Most of red wine total dioxide counts should be 0-80.
# Plot chlorides counts
ggplot(aes(x = chlorides), data = rw) + 
  geom_histogram(binwidth = 0.001) + 
  coord_cartesian(xlim = c(0.03, 0.14))

## Most of value between c(0.03, 0.14), and similar a normal distribution
# Plot fixed.acidity counts
ggplot(aes(x = fixed.acidity), data = rw) + 
  geom_histogram(binwidth = 0.1)

# Plot volatile.acidity counts
ggplot(aes(x = volatile.acidity), data = rw) + 
  geom_histogram(binwidth = 0.01)

# Plot volatile.acidity counts
ggplot(aes(x = residual.sugar), data = rw) + 
  geom_histogram(binwidth = 0.1) +
  coord_cartesian(xlim = c(1.2, 3.2))

## Most residual sugar value between c(1.2, 3.2), like a normal distribution
ggplot(aes(x = citric.acid), data = rw) +
  xlab("citric acid")+
  geom_bar(colour = "black", fill = "#990066")

table(rw$citric.acid)
## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##  132   33   50   30   29   20   24   22   33   30   35   15   27   18   21 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   19    9   16   22   21   25   33   27   25   51   27   38   20   19   21 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##   30   30   32   25   24   13   20   19   14   28   29   16   29   15   23 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   22   19   18   23   68   20   13   17   14   13   12    8    9    9    8 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    9    2    1   10    9    7   14    2   11    4    2    1    1    3    4 
## 0.75 0.76 0.78 0.79    1 
##    1    3    1    1    1
## we can see most this value is 0, 0.02, 0.24,0.49

Univariate Analysis

What is the structure of your dataset?

  1. Total 1599 observation, and very observation have 11 variable, and no na in it
  2. some features may relatives, such as free.sulfur.dioxide and total.sulfur.dioxide
  3. Some feature value are long tail, should clean it when do future analysis.

What is/are the main feature(s) of interest in your dataset?

Main features is quality, we can based on other features to determind quality, we can some analysis to know which features positive affect the red wine quality, and others is opposite.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

we can guess feature like residual.sugar, Ph, sulphates, chlorides, density and alcohol.

Did you create any new variables from existing variables in the dataset?

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Except some long tail data is tidy except x is only index, no means in time. and sometime also found some data can need remove outlier data, like:

ggplot(aes(x = free.sulfur.dioxide, y =  total.sulfur.dioxide), data = rw) + 
  geom_point(aes(color = quality))

# We can two data of total.sulfur.dioxide more than 200, remove outlier
rw2 <- subset(rw, rw$total.sulfur.dioxide < 200)

ggplot(aes(x = free.sulfur.dioxide, y =  total.sulfur.dioxide), data = rw2) + 
  geom_point(aes(color = quality))

Bivariate Plots Section

ref to How to read a box plot/Introduction to box plots

## Plot relations free.sulfur.dioxide with total.sulfur.dioxide
ggplot(aes(y = free.sulfur.dioxide, x = total.sulfur.dioxide), data = rw2) +
  geom_point(aes(color = quality), alpha = 1/2) +
  stat_smooth(method = 'lm')

## We can free.sulfur.dioxide relates with total.sulfur.dioxide, should be only consider one like total.sulfur.dioxide in future analysis.
## Plot ph and quality correction, 

## transform quality to number
rw2$quality <- as.numeric(rw2$quality)

str(rw2)
## 'data.frame':    1597 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : num  5 5 5 6 5 5 5 7 7 5 ...
ggplot(aes(x = pH, y = quality), data = rw2) + 
  geom_point(aes(color = quality), alpha = 1/2) +
  stat_smooth(method = 'lm')

## For below plot, features pH has a nagative affection the red wine quality
# Plot relation about total.sulfur.dioxide and quality
ggplot(aes(x = total.sulfur.dioxide, y = quality), data = rw2) + 
  geom_point(color = I('#F79420'), alpha = 1/4) +  
             stat_smooth(method = 'lm')

## We can see total.sulfur.dioxide have nagative affect 
ggplot(aes(x = residual.sugar , y = quality), data = rw2) + 
  geom_point(color = I('#F79420'), alpha = 1/4) +  
             stat_smooth(method = 'lm')

## Looks like Residual.sugar  no affect for wine quality
ggplot(aes(x = chlorides, y = quality), data = rw2) + 
  geom_point(color = I('#F79420'), alpha = 1/4) +  
             stat_smooth(method = 'lm')

## Based on below plot, alcohol have stronger positive affection the wine quality
ggplot(aes(x = alcohol, y = quality), data = rw2) + 
  geom_point(color = I('#F79420'), alpha = 1/4) +  
             stat_smooth(method = 'lm')

## Based on below plot, alcohol have stronger positive affection the wine quality
ggplot(aes(x = sulphates, y = quality), data = rw2) + 
  geom_point(color = I('red'), alpha = 1/2) +  
             stat_smooth(method = 'lm')

## Based on below plot, sulphates have stronger positive affection the wine quality
ggplot(aes(x = volatile.acidity, y = quality), data = rw2) +
  geom_point() +
  scale_x_log10(breaks=seq(.1,1,.1)) +
  xlab("log10(volatile.acidity)") +
  geom_smooth(method="lm")

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

As upper plots shows, we know: 1. Features chlorides, total.sulfur.dioxide and pH has nagative affection for red wine quality; 2. Features sulphates, alcohol has positive affect for red wine quality 3. Looks like Residual.sugar no affect for wine quality 4. We know free.sulfur.dioxide relates with total.sulfur.dioxide, should be only consider one like total.sulfur.dioxide in future analysis.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

## Plot citric.acid affect the pH

## citric acid is  small quantities, citric acid can add 'freshness' and flavor to wines, should be > 0.
ggplot(data = subset(rw2, citric.acid > 0), aes(x = citric.acid, y = pH)) +
  geom_point() +
  scale_x_log10() +
  xlab("log10(citric.acid)") +
  geom_smooth(method="lm")

Based on below plot, we know followed citric.acid increase the pH turn acidic, that’s meets our knowledge.

What was the strongest relationship you found?

## Plot fixed.acidity affect the pH
ggplot(aes(x = fixed.acidity, y = pH), data = rw2) +
  geom_point() +
  #scale_x_log10() +
  xlab("fixed.acidity") +
  geom_smooth(method="lm")

We can see the fixed.acidity has stronger relationship with pH.

Multivariate Plots Section

## Plot fixed.acidity, alcohol and quality relations

ggplot(aes(y = sulphates, x = alcohol,
           color = quality), data = rw2) + 
  geom_line() +
  # select fixed.acidity
  scale_y_continuous(limits=c(0.3, 1.2)) +
  facet_wrap(~quality)

In part2 we know sulphates and alcohol all have postive affection for wine quality, upper plots also shows, high quality wine have high alcohol and fixed.acidity.

## Plot chlorides, alcohol and quality relations
ggplot(aes(x = chlorides, y = residual.sugar), data = rw2) + 
  geom_point(size = 3, shape = 1) +
  scale_x_continuous(limits=c(0.05, 0.2)) +
  scale_y_continuous(limits=c(1, 8)) +
  facet_wrap(~quality) 
## Warning: Removed 106 rows containing missing values (geom_point).

## No found obvious affection for those feature

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

From upper plot, looks like better quality red wine always has high alcohol and high sulphate concentrations.

Were there any interesting or surprising interactions between features?

I just found salt residual.sugar all no any affect to wine quality.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

## Select 80% as training data, others as test data
training_data <- sample_frac(rw2, .8)

test_data <- rw2[!rw2$X %in% training_data$X, ]

## built model
m1 <- lm(quality ~ alcohol, data = training_data)
m2 <- update(m1, ~ . + sulphates)
m3 <- update(m2, ~ . + total.sulfur.dioxide)
m4 <- update(m3, ~ . + chlorides)
m5 <- update(m4, ~ . + pH)
mtable(m1, m2, m3, m4, m5)
## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = training_data)
## m2: lm(formula = quality ~ alcohol + sulphates, data = training_data)
## m3: lm(formula = quality ~ alcohol + sulphates + total.sulfur.dioxide, 
##     data = training_data)
## m4: lm(formula = quality ~ alcohol + sulphates + total.sulfur.dioxide + 
##     chlorides, data = training_data)
## m5: lm(formula = quality ~ alcohol + sulphates + total.sulfur.dioxide + 
##     chlorides + pH, data = training_data)
## 
## ==============================================================================================
##                              m1            m2            m3            m4            m5       
## ----------------------------------------------------------------------------------------------
##   (Intercept)               1.854***      1.416***      1.712***      2.037***      4.217***  
##                            (0.194)       (0.198)       (0.208)       (0.214)       (0.459)    
##   alcohol                   0.363***      0.352***      0.333***      0.304***      0.322***  
##                            (0.019)       (0.018)       (0.019)       (0.019)       (0.019)    
##   sulphates                               0.835***      0.881***      1.164***      1.060***  
##                                          (0.110)       (0.110)       (0.121)       (0.121)    
##   total.sulfur.dioxide                                 -0.003***     -0.003***     -0.003***  
##                                                        (0.001)       (0.001)       (0.001)    
##   chlorides                                                          -2.343***     -2.718***  
##                                                                      (0.435)       (0.436)    
##   pH                                                                               -0.687***  
##                                                                                    (0.128)    
## ----------------------------------------------------------------------------------------------
##   R-squared                 0.231         0.264         0.275         0.292         0.307     
##   adj. R-squared            0.231         0.263         0.274         0.289         0.304     
##   sigma                     0.706         0.691         0.686         0.679         0.671     
##   F                       384.099       229.215       161.421       130.962       112.801     
##   p                         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood        -1367.451     -1339.329     -1329.738     -1315.365     -1301.088     
##   Deviance                635.976       608.594       599.528       586.193       573.241     
##   AIC                    2740.901      2686.658      2669.476      2642.730      2616.176     
##   BIC                    2756.361      2707.270      2695.242      2673.648      2652.247     
##   N                      1278          1278          1278          1278          1278         
## ==============================================================================================
## cal error
df <- data.frame(
  test_data$quality,
  predict(m5, test_data) - test_data$quality)

names(df) <- c("quality", "error")

ggplot(aes(x = quality, y = error), data = df) +
  geom_boxplot()
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## error close 0, means we can use upper features to determine red wine qaulity
# Final Plots and Summary
### Plot One
### Description One In this data set most of quality is 5 or 6, not enough bad/best quality to do analysis, so that this dataset may not good for adjust very good red wine.
### Plot Two
## Warning: Removed 67 rows containing missing values (geom_point).
### Description Two From upper plot, we know sulphate and alcohol all have affection the red wine quality.
### Plot Three
### Description Three As we have enough data for bad/good(> 7 or < 4) quality, we can use this dataset do linear regression shows error are very big.

Reflection

Try to use R to analysis red wine data, to find out what’s features will affection red wine quality. 1. Try use Univariate Analysis single feature relationship with wine quality; 2. use Bivariate to analysis more feature affect the wine quality; 3. also try use linear regression use top 5 features to analysis prediction quality.

and, R is very different with python, need more exercise to familary this tools, in fact, like ggplot2 is very good plot package, it’s more simple than matplotlib in python.